IAGLR 2021 H2O Workshop (bit.ly/xxxxxxxxxx)
1 Agenda
- 09:00 to 09:30 Set Up & Introduction
- 09:30 to 10:30 Regression Example
- 10:30 to 11:00 Coffee Break
- 11:00 to 11:30 Classification Example
- 11:30 to 12:30 Bring Your Own Data + Q&A
2 Set Up
2.1 Download -> bit.ly/useR2019_h2o_tutorial
setup.R: install packages requiredtutorial.Rmd: the main RMarkdown file with codetutorial.html: this webpage- GitHub Repo https://github.com/woobe/IAGLR_2021_H2O_Workshop
2.2 R Packages
- Check out
setup.R - For this tutorial:
h2ofor automatic and explainable machine learning
- For RMarkdown
knitrfor rendering this RMarkdownrmdformatsforreadthedownRMarkdown templateDTfor nice tables
3 Introduction
General Data Protection Regulation (GDPR) is now in place. Are you ready to explain your models? This is a hands-on tutorial for R beginners. I will demonstrate the use of H2O and other R packages for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with various methods.
It is a workshop for R beginners and anyone interested in machine learning. RMarkdown and the rendered HTML will be provided so everyone can follow without running the code.
4 H2O Basics
# Let's go
library(h2o) # for H2O Machine Learning4.1 Start a local H2O Cluster (JVM)
h2o.init() Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 hours 37 minutes
H2O cluster timezone: Europe/London
H2O data parsing timezone: UTC
H2O cluster version: 3.32.1.2
H2O cluster version age: 14 days
H2O cluster name: H2O_started_from_R_joe_sid571
H2O cluster total nodes: 1
H2O cluster total memory: 2.29 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.0.5 (2021-03-31)
h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll() # Optional: remove anything from previous session # Enter your lucky seed here ...
n_seed <- 123455 Data - Dom Lake Huron Abund
lake_data <- h2o.importFile('https://raw.githubusercontent.com/woobe/IAGLR_2021_H2O_Workshop/main/data/Dom_Lake_Huron_Abund.csv')datatable(as.data.frame(head(lake_data)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)h2o.describe(lake_data) Label Type Missing Zeros PosInf NegInf Min Max Mean
1 C1 int 0 0 0 0 1.0000 885.00000 443.00000
2 Year int 0 0 0 0 2006.0000 2012.00000 2009.00113
3 Region enum 0 96 0 0 0.0000 3.00000 NA
4 Station enum 0 24 0 0 0.0000 122.00000 NA
5 Replicate int 0 1 0 0 0.0000 3.00000 1.99887
6 lat real 0 0 0 0 43.2695 46.23333 44.68487
Sigma Cardinality
1 255.6217909 NA
2 2.3727809 NA
3 NA 4
4 NA 123
5 0.8190319 NA
6 0.8554837 NA
[ reached 'max' / getOption("max.print") -- omitted 62 rows ]
h2o.hist(lake_data$DPOL, breaks = 100)h2o.hist(lake_data$DBUG, breaks = 100)5.1 Define Target and Features
target_DPOL <- "DPOL"
target_DBUG <- "DBUG"
# Remove targets, C1, and Dreisseniidae (which is DPOL + DBUG)
features <- setdiff(colnames(lake_data), c(target_DPOL, target_DBUG, "C1", "Dreisseniidae"))
print(features) [1] "Year" "Region" "Station"
[4] "Replicate" "lat" "lon"
[7] "depth" "substrate" "Season"
[10] "Na2O" "Magnesium" "MagnesiumOxide"
[13] "AluminumOxide" "SiliconDioxide" "Quartz"
[16] "PhosphorusPentoxide" "Sulphur" "DDT_TOTAL"
[19] "P.P.TDE" "P.P.DDE" "P.P.DDE.1"
[22] "HeptachlorEpoxide" "Potassium" "PotassiumOxide"
[25] "Calcium" "CalciumOxide" "TitaniumDioxide"
[28] "Chromium" "Manganese" "ManganeseOxide"
[31] "XTR.Iron" "Iron" "IronOxide"
[34] "Cobalt" "Nickel" "Copper"
[37] "Zinc" "Selenium" "Strontium"
[40] "Beryllium" "Silver" "Cadmium"
[43] "Tin" "TotalCarbon" "OrganicCarbon"
[46] "Carbon.Nitrogen_Ratio" "TotalNitrogen" "Mercury"
[49] "Lead" "Uranium" "Vanadium"
[52] "Arsenic" "Chloride" "Fluoride"
[55] "Sand" "Silt" "Clay"
[58] "Mean_grainsize" "simpsonD" "shannon"
[61] "Chironomidae" "Oligochaeta" "Sphaeriidae"
[64] "Diporeia"
5.2 Split Data into Train/Test
h_split <- h2o.splitFrame(lake_data, ratios = 0.8, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluationdim(h_train)[1] 710 68
dim(h_test)[1] 175 68
5.3 Cross-Validation
6 Regression Example - Target “DPOL”
6.0.1 Baseline Models
h2o.glm(): H2O Generalized Linear Modelh2o.randomForest(): H2O Random Forest Modelh2o.gbm(): H2O Gradient Boosting Modelh2o.deeplearning(): H2O Deep Neural Network Modelh2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model (XGBoost) from DMLC
Let’s start with GBM
model_gbm_DPOL <- h2o.gbm(x = features, # All 13 features
y = target_DPOL, # medv (median value of owner-occupied homes in $1000s)
training_frame = h_train, # H2O dataframe with training data
model_id = "baseline_gbm", # Give the model a name
nfolds = 5, # Using 5-fold CV
seed = n_seed) # Your lucky seed# Cross-Validation
model_gbm_DPOL@model$cross_validation_metricsH2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 5359.94
RMSE: 73.21161
MAE: 16.44797
RMSLE: NaN
Mean Residual Deviance : 5359.94
# Evaluate performance on test
h2o.performance(model_gbm_DPOL, newdata = h_test)H2ORegressionMetrics: gbm
MSE: 1357.527
RMSE: 36.84464
MAE: 10.64912
RMSLE: NaN
Mean Residual Deviance : 1357.527
Let’s use RMSE
6.0.2 Build Other Baseline Models (GLM, DRF, GBM, DNN) - TRY IT YOURSELF!
# Try other H2O models
# model_glm <- h2o.glm(x = features, y = target, ...)
# model_gbm <- h2o.gbm(x = features, y = target, ...)
# model_drf <- h2o.randomForest(x = features, y = target, ...)
# model_dnn <- h2o.deeplearning(x = features, y = target, ...)
# model_xgb <- h2o.xgboost(x = features, y = target, ...)6.1 Manual Tuning
6.1.1 Check out the hyper-parameters for each algo
?h2o.glm
?h2o.randomForest
?h2o.gbm
?h2o.deeplearning
?h2o.xgboost6.1.2 Train a xgboost model with manual settings
model_gbm_DPOL_m <- h2o.gbm(x = features,
y = target_DPOL,
training_frame = h_train,
model_id = "model_gbm_m",
nfolds = 5,
seed = n_seed,
# Manual Settings based on experience
learn_rate = 0.1, # use a lower rate (more conservative)
ntrees = 100, # use more trees (due to lower learn_rate)
sample_rate = 0.9, # use random n% of samples for each tree
col_sample_rate = 0.9) # use random n% of features for each tree6.1.3 Comparison (RMSE: Lower = Better)
# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("GBM: Gradient Boosting Model (Baseline)",
"GBM: Gradient Boosting Model (Manual Settings)"),
stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_gbm_DPOL@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_gbm_DPOL_m@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL_m, newdata = h_test))# Show Comparison (RMSE: Lower = Better)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)6.2 Hierarchical GLM (HGLM) with Random Effects
# Not working yet ...
# model_hglm_DPOL <- h2o.glm(x = features,
# y = target_DPOL,
# training_frame = h_train,
# model_id = "model_hglm",
# seed = n_seed,
# family = "gaussian",
# rand_family = c("gaussian"),
# rand_link = c("identity"),
# HGLM = TRUE,
# random_columns = "Year",
# calc_like = TRUE)6.3 H2O AutoML
# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl_DPOL = h2o.automl(x = features,
y = target_DPOL,
training_frame = h_train,
nfolds = 5, # 5-fold Cross-Validation
max_models = 20, # Max number of models
stopping_metric = "RMSE", # Metric to optimize
project_name = "automl_DPOL", # Specify a name so you can add more models later
seed = n_seed)6.3.1 Leaderboard
datatable(as.data.frame(automl_DPOL@leaderboard),
rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)6.3.2 Best Model (Leader)
automl_DPOL@leaderModel Details:
==============
H2ORegressionModel: deeplearning
Model ID: DeepLearning_grid__3_AutoML_20210513_171638_model_2
Status of Neuron Layers: predicting DPOL, regression, gaussian distribution, Quadratic loss, 5,101 weights/biases, 84.5 KB, 1,185,700 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms
1 1 211 Input 10.00 % NA NA NA NA
2 2 20 RectifierDropout 10.00 % 0.000000 0.000000 0.111998 0.200869
3 3 20 RectifierDropout 10.00 % 0.000000 0.000000 0.007790 0.013263
4 4 20 RectifierDropout 10.00 % 0.000000 0.000000 0.011148 0.014293
5 5 1 Linear NA 0.000000 0.000000 0.002238 0.001988
momentum mean_weight weight_rms mean_bias bias_rms
1 NA NA NA NA NA
2 0.000000 0.009646 0.278547 0.439409 1.247423
3 0.000000 -0.198634 0.412209 0.579820 0.792267
4 0.000000 -0.187279 0.642276 1.320260 0.579371
5 0.000000 0.000325 0.354135 0.696467 0.000000
H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **
MSE: 2304.368
RMSE: 48.00383
MAE: 21.53961
RMSLE: 2.67551
Mean Residual Deviance : 2304.368
H2ORegressionMetrics: deeplearning
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 4209.012
RMSE: 64.8769
MAE: 19.68972
RMSLE: NaN
Mean Residual Deviance : 4209.012
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid
mae 19.689724 4.1271563 17.920628 15.404459 18.325184
mean_residual_deviance 4209.0127 1434.702 3613.0261 2488.5093 3668.7349
mse 4209.0127 1434.702 3613.0261 2488.5093 3668.7349
r2 0.5286015 0.17153801 0.5485696 0.49927586 0.72438794
residual_deviance 4209.0127 1434.702 3613.0261 2488.5093 3668.7349
rmse 64.11342 11.095151 60.10845 49.88496 60.570084
rmsle 2.0851088 0.0 2.0851088 NaN NaN
cv_4_valid cv_5_valid
mae 20.461622 26.33673
mean_residual_deviance 6159.7417 5115.051
mse 6159.7417 5115.051
r2 0.60964584 0.26112828
residual_deviance 6159.7417 5115.051
rmse 78.484024 71.519585
rmsle NaN NaN
6.3.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
RMSE_cv = automl_DPOL@leader@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(automl_DPOL@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)6.4 Make Predictions
yhat_test <- h2o.predict(automl_DPOL@leader, newdata = h_test)
head(yhat_test) predict
1 19.14614
2 19.12452
3 13.77955
4 14.24310
5 13.70907
6 13.81577
6.5 Explainable AI - Target “DPOL”
6.5.1 Global Explanations
# Explain Manually Tuned GBM
h2o.explain(model_gbm_DPOL_m, newdata = h_test)
Residual Analysis
=================
> Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
============
> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectations
===================================
> An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
6.5.2 Local Explanations
6.5.3 Local Contributions (for Tree-based Models)
predictions <- h2o.predict(model_gbm_DPOL_m, newdata = h_test)
as.data.frame(head(predictions, 5)) predict
1 33.233473
2 7.628626
3 13.034093
4 6.006849
5 14.779455
contributions <- h2o.predict_contributions(model_gbm_DPOL_m, newdata = h_test)
datatable(as.data.frame(head(contributions)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)7 Regression Example - Target “DBUG”
T.B.A.
8 Bring Your Own Data + Q&A
Get your hands dirty!